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Background.  For  diagnostic  processes  involving  continual 
measurements  from  a  single  patient,  conventional  test  char¬ 
acteristics,  such  as  sensitivity  and  specificity,  do  not  con¬ 
sider  decision  consistency,  which  might  be  a  distinct, 
clinically  relevant  test  characteristic.  Objective.  The  authors 
investigated  the  performance  of  a  decision-support  classifier 
for  the  diagnosis  of  traumatic  injury  with  blood  loss,  imple¬ 
mented  with  three  different  data-processing  methods.  For 
each  method,  they  computed  standard  diagnostic  test  char¬ 
acteristics  and  novel  metrics  related  to  decision  consistency 
and  latency.  Setting.  Prehospital  air  ambulance  transport. 
Patients.  A  total  of  557  trauma  patients.  Design.  Continu¬ 
ally  monitored  vital-sign  data  from  279  patients  (50%) 
were  randomly  selected  for  classifier  development,  and  the 
remaining  were  used  for  testing.  Three  data-processing 
methods  were  evaluated  over  16  min  of  patient  monitoring: 
a  2-min  moving  window,  time  averaging,  and  postprocessing 
with  the  sequential  probability  ratio  test  (SPRT).  Measure¬ 
ments.  Sensitivity  and  specificity  were  computed. 


Consistency  was  quantified  through  cumulative  counts  of 
decision  changes  over  time  and  the  fraction  of  patients 
affected  by  false  alarms.  Latency  was  evaluated  by  the  frac¬ 
tion  of  patients  without  a  decision.  Results.  All  3  methods 
showed  very  similar  final  sensitivities  and  specificities. 
Yet,  there  were  significant  differences  in  terms  of  the  fraction 
of  patients  affected  by  false  alarms,  decision  changes 
through  time,  and  latency.  For  instance,  use  of  the  SPRT 
led  to  a  75%  reduction  in  the  number  of  decision  changes 
and  a  36%  reduction  in  the  number  of  patients  affected  by 
false  alarms,  at  the  expense  of  3%  unresolved  final  deci¬ 
sions.  Conclusion.  The  proposed  metrics  of  decision  consis¬ 
tency  and  decision  latency  provided  additional  information 
beyond  what  could  be  obtained  from  test  sensitivity  and 
specificity  and  are  likely  to  be  clinically  relevant  in 
some  applications  involving  temporal  decision  making.  Key 
words:  continual  patient  monitoring;  decision-support  algo¬ 
rithm;  sequential  probability  ratio  test;  physiological  data. 
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Continual  physiological  monitoring  is  standard 
practice  in  many  health  care  arenas,  e.g.,  hospi¬ 
tal  wards  and  operating  rooms,  where  vital-sign  data 
are  measured  repeatedly  so  that  if  instability  occurs 
it  can  be  detected  and  treated  promptly.  However, 
false  alarms  are  a  major  problem  because  standard 
alarms  are  triggered  when  certain  parameter  thresh¬ 
olds  are  reached.  1“'i  All  too  often,  the  abnormality 
that  triggers  an  alarm  is  either  a  measurement  arti¬ 
fact  or  a  benign  transient  event.  Yet,  when  false 
alarms  occur  frequently,  there  is  a  deleterious  effect 
on  patients  in  that  caregivers  may  be  slow  to 
respond  to  alarms  with  low  positive  predictive 
value.4 
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CHEN  AND  OTHERS 


In  this  report,  we  considered  a  set  of  analytic  meth¬ 
ods  for  detecting  abnormalities  from  continual  phys¬ 
iological  data  and  examined  how  the  techniques 
compared  through  time.  We  examined  whether  stan¬ 
dard  test  characteristics  (sensitivity  and  specificity) 
were  adequate  for  describing  the  resultant  alarm 
behaviors  from  one  time  interval  to  the  next.  Specifi¬ 
cally,  we  developed  metrics  to  measure  the  temporal 
stability  of  test  decisions,  which  we  termed  consis¬ 
tency,  and  examined  the  extent  to  which  patient 
alarms  were  consistent  through  time.  Our  intent 
was  to  describe  whether  alarms  tended  to  reoccur  in 
the  same  patients  from  one  time  period  to  the  next 
(on  whom  the  clinical  staff  would  be  able  to  focus 
attention)  or  if  (false)  alarms  were  distributed  through¬ 
out  the  entire  monitored  population  (so  that  many 
disparate  patients  would — unnecessarily — require 
attention  as  the  alarms  were  triggered). 

We  focused  on  several  basic  methods  for  pre-  and 
postprocessing  of  continual  vital-sign  data  into  and 
out  of  a  core  alarm  algorithm.  Analytic  methods  for 
identifying  irregularities  from  a  set  of  time-series 
data  have  been  well  established  in  the  manufacturing 
quality  control  literature.  Methods  dealing  with  this 
problem  include  the  sequential  probability  ratio  test 
(SPRT),5,6  which  evaluates  the  likelihood  ratio  of  2 
hypotheses  based  on  sequentially  available  eviden¬ 
ces.  Alternatives  include  the  control  chart  method/  8 
the  belief-modeling  method,9  and  other  Bayesian- 
based  methods.10,11  Among  these  methods,  the  SPRT 
is  simple  to  calculate  and,  for  given  false-positive 
and  false-negative  probabilities,  requires  the  smallest 
number  of  samples  to  achieve  a  decision  (unless  the 
statistical  model  is  grossly  incorrect).5 

Our  goal  was  to  investigate  if  conventional  test 
characteristics  were  adequate  for  assessing  the  basic 
performance  of  an  alarm  or  if  it  was  also  necessary 
to  consider  its  temporal  consistency.  In  a  comparative 
analysis,  we  employed  3  methods  for  pre-  and  post¬ 
processing  of  continual  data  into  and  out  of  our  core 
alarm  algorithm.  Compared  with  a  2-min  moving 
window,  we  examined  if  additional  time  averaging 
and  the  SPRT  could  alter  the  overall  accuracy,  the 
temporal  consistency,  and  the  latency  of  the  algo¬ 
rithm  output.  The  core  alarm  algorithm  was  a  multi¬ 
variate  classifier  for  the  diagnosis  of  traumatic 
injury  with  blood  loss,  given  data  from  a  standard 
prehospital  patient  monitor.1"  This  analysis  has 
implications  for  any  diagnostic  process  involving 
continual  vital-sign  measurements  from  a  single 
patient. 


226  •  MEDICAL  DECISION  MAKING/FEB  2013 


METHODS 

This  is  a  retrospective,  comparative  analysis, 
based  on  a  previously  reported  ensemble  classi¬ 
fier,12  which  provides  automated  detection  of  trau¬ 
matic  injury  with  blood  loss  in  prehospital 
patients  based  on  basic  vital  signs.  We  simulated  3 
methods  to  process  real-time  data  during  the  initial 
16  min  of  prehospital  patient  transportation.  The  mov¬ 
ing  window  method  involved  a  moving  2-min  analysis 
window;  at  every  moment  in  time,  the  classifier  was 
applied  to  the  most  recent  2  min  of  physiological 
data.  The  time-averaging  method  analyzed  all  avail¬ 
able  data  from  a  given  patient,  from  the  onset  of  the 
data  record  to  the  current  time  (up  to  a  maximum  of 
16  min).  In  the  SPRT  method,  we  applied  the  SPRT 
to  the  output  of  the  classifier. 

Trauma  Patient  Data 

The  physiological  time-series  data  were  collected 
from  643  trauma-injured  patients  during  their  first 
16  min  of  helicopter  transport  to  a  trauma  center.13 
The  time-series  variables  were  measured  by  ProPaq 
206EL  vital-sign  monitors  (Protocol  Systems,  Bea¬ 
verton,  OR)  and  consisted  of  electrocardiogram, 
photoplethysmogram,  and  respiratory  waveform 
signals  recorded  at  various  frequencies  and  their 
corresponding  monitor-calculated  vital  signs, 
including  heart  rate  (HR),  respiratory  rate  (RR),  and 
arterial  oxygen  saturation  (Sa02),  recorded  at  1-s 
intervals,  and  systolic  (SBP),  mean,  and  diastolic 
(DBP)  blood  pressures  collected  intermittently  at 
multiminute  intervals. 

We  performed  chart  reviews  to  determine  whether 
the  transported  trauma  patients  had  traumatic  injury 
with  blood  loss.  Traumatic  injury  with  blood  loss  was 
defined  as  requirement  of  a  blood  transfusion  within 
24  h  upon  arrival  at  the  trauma  center  and  also  docu¬ 
mentation  of  an  explicitly  hemorrhagic  injury,  either 
a)  laceration  of  solid  organs,  b)  thoracic  or  abdominal 
hematomas,  c)  explicit  vascular  injury  and  operative 
repair,  or  d)  limb  amputation.  Patients  who  received 
blood  but  did  not  meet  the  documented  injury  criteria 
(60  cases),  and  patients  who  died  before  arrival  at  the 
hospital  (26  cases)  were  excluded  from  the  analyses 
because  of  uncertainty  about  whether  they  truly  suf¬ 
fered  traumatic  injury  with  blood  loss.  Thus,  we 
used  a  total  of  557  patients,  of  which  61  were  catego¬ 
rized  as  patients  with  traumatic  injury  and  blood  loss 
and  the  remaining  496  as  controls. 
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Decision-Support  Classifier:  Training 

The  ensemble  classifier  aggregates  25  least-squares 
linear  classifiers,  each  trained  with  a  different  subset 
of  5  input  variables  (HR,  RR,  Sa02,  SBP,  and  DBP)  and 
with  target  values  of  0.0  and  1.0,  standing  for  control 
and  traumatic  injury  with  blood  loss  outcomes, 
respectively,  to  generate  an  (arithmetic)  average  out¬ 
put  that  can  be  used  to  discriminate  the  2  outcomes.12 
We  assigned  ensemble-averaged  outputs  of  <  0.5  as 
control  outcomes  and  outputs  of  >  0.5  as  traumatic 
injury  with  blood  loss.  The  ensemble  classifier  has 
been  shown  to  provide  more  consistent  performance 
than  a  single  linear  classifier,  and  importantly,  it 
accommodates  missing  data,  providing  an  output  as 
long  as  any  1  of  the  5  inputs  is  available.12 

We  randomly  selected  50%  of  the  study  popula¬ 
tion  (279  patients;  248  controls  and  31  patients  with 
traumatic  injury  and  blood  loss)  to  train  (i.e., 
develop)  the  classifier.  In  prior  studies,14  we  found 
that  prehospital  vital-sign  data  are  very  noisy,  and 
hence,  we  developed  algorithms  that  automatically 
assess  the  reliability  of  each  vital  sign  used  as  input 
to  the  classifier.15-17  We  also  reported  that  reliable 
data  are  diagnostically  superior  to  unreliable 
data.15'18  In  another  study,14  we  found  that  there  are 
no  major  time-series  trends  in  these  vital-sign  data, 
and  averaging  the  most  reliable  data  during  transport 
yielded  the  best  discriminatory  performance.  Conse¬ 
quently,  we  used  the  average  value  of  the  most  reli¬ 
able  training  data  points  from  the  first  16  min  of 
transport  time  as  input  to  train  the  ensemble 
classifier. 

Evaluation  of  the  Moving  Window,  Time-Averaging, 
and  SPRT  Methods 

We  investigated  3  methods  to  pre-  and  postprocess 
the  ensemble  classifier  data.  In  each  method,  1)  the 
first  2  min  of  transport  vital-sign  data  were  used  as 
a  buffer  where  no  classifications  were  made;  2)  every 
1  s  we  averaged  the  most  reliable  available  vital-sign 
data  (HR,  RR,  etc.)  over  a  specified  time  window, 
input  the  averaged  values  to  the  classifier,  and 
obtained  an  output;  and  3)  every  15  s,  we  averaged 
the  previous  15  classifier  outputs  to  generate  a  deci¬ 
sion.  The  3  methods  differed  on  the  length  of  the  pre¬ 
processing  time  window  of  the  classifier  input  data  in 
item  2  (above)  and  on  any  additional  postprocessing 
in  the  classifier  outputs  in  item  3. 

For  the  moving  window,  we  averaged  the  classifier 
inputs  over  a  2-min  time  window  and  compared  the 
averaged  decision  every  15  s  with  a  0.5  threshold. 
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The  time-averaging  method  differed  from  the  first 
method  in  that  the  length  of  the  time  window  for  aver¬ 
aging  the  vital-sign  input  data  grew  continually  up  to 
the  current  decision  time  so  that  all  available  data 
were  considered  for  each  decision.  In  the  SPRT 
method,  the  classifier  outputs  were  further  processed 
as  inputs  to  the  SPRT  to  generate  a  SPRT  decision  (or 
no  decision),  as  described  below. 

The  Sequential  Probability  Ratio  Test 

We  investigated  the  ability  of  Wald’s  SPRT5  0  to 
consider  the  sequential  nature  and  postprocess  the 
outputs  of  the  ensemble  classifier  while  balancing 
decision  accuracy,  consistency,  and  latency.  Given 
a  sequence  of  outputs  Y1,Y2,  ■  ■  ■  not  necessarily  inde¬ 
pendent  from  the  ensemble  classifier,  so  that  Y  = 
AT((iy,(Ty)  is  a  normal  Gaussian  process  with  an 
unknown  mean  p  K  and  a  known  constant  variance 
uy,  the  SPRT  classifies  a  patient  as  control  or  trau¬ 
matic  injury  with  blood  loss,  or  makes  no  decision, 
based  on  hypothesis  testing.  Note  that  aY  was  esti¬ 
mated  as  the  variance  of  the  ensemble  classifier  out¬ 
puts  at  the  end  of  the  transport,  i.e.,  at  16  min,  and 
was  kept  fixed  throughout  the  analysis.  The  SPRT 
tests  a  null  hypothesis  (H0)  that  |xy  =  p,0  against  an 
alternative  hypothesis  (HJ  that  i±Y=  p,] ,  where  (jl0  and 
|ji  i  denote  the  arithmetic  mean  values  of  the  classifier 
outputs  for  the  control  and  traumatic  injury  with 
blood  loss  cases,  respectively,  with  p,0  <  p^.  If  we 
let  p0  and  p1  be  the  probability  density  functions  gov¬ 
erning  the  two  hypotheses,  H0  and  H, ,  respectively, 
then  the  observed  likelihood  ratio  at  decision  time  / 

can  be  represented  as  lj  =  f]  ,  with  /=  1,2,.... 

i=i  °  J 

According  to  Wald’s  SPRT  methodology,5  we 

accept  H0  (control),  if  log  (lj)  <  B;  or 
accept  Hi  (traumatic  injury  with  blood  loss),  (1) 
if  log(fj)  >  A;  or 

continue  to  decision  time  J  +  1,  if  B  <  l°g (W  <  A, 

where  A  and  B  are  constants,  with  0  <  B  <  A  <  o°,  cho¬ 
sen  using  Wald’s  criteria,5  as  to  yield  nominal  false¬ 
positive  probability  (a;  0.0  <  a  <  0.5)  and  nominal 
false-negative  probability  ((3;  0.0  <  |3  <  0.5)  as  follows: 

A  =  log  i — - ,  and 

a  /o\ 


When  a  and  (3  are  relatively  small  (e.g.,  <  0.05),  the 
SPRT  tends  to  delay  making  a  decision  until 
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additional  corroborating  classifier  outputs  become 
available.  Conversely,  when  a  and  |3  are  large  (e.g., 
=  0.5),  the  SPRT  makes  quicker,  albeit  less  accurate, 
decisions.  Thus,  by  appropriately  selecting  these  two 
parameters,  we  can  balance  decision  accuracy,  con¬ 
sistency,  and  delay.  To  this  end,  we  determined  the 
nominal  probabilities  a  and  |3  by  minimizing  a  cost 
function  q,  which  linearly  combined  the  accuracy 
of  the  decisions,  defined  by  its  sensitivity  (Se)  and 
specificity  (Sp),  at  the  end  of  the  transport  (i.e. ,  at  16 
min);  the  cumulative  incidences  of  decision  changes 
[Dc;  from  control  to  traumatic  injury  with  blood  loss 
and  vice  versa)  over  the  16  min  of  transport  time; 
and  the  fraction  of  patients  with  no  decision  (Aid)  at 
the  end  of  the  transport.  Accordingly,  we  defined  q 
as  follows: 


1  -Se  |  1-Sp  Dc  Nd 
0.05  0.05  10  0.01’ 


(3) 


where  the  rescaling  factors  of  the  summands  were 
empirically  obtained  through  SPRT  trial  simulations 
on  the  training  data  so  to  normalize  the  effect  of  each 
of  the  four  summands  on  0. 

Under  the  Gaussian  model,  the  log-likelihood  ratio 
log(i7)  in  equation  1  can  be  recursively  calculated  as 
follows: 

log(W)=  log(Zj)+  (YJ+1  -  J  =  0, 1,2, , 

cry  2 

(4) 


where  the  initial  log-likelihood  log(70)  can  be  selected 
arbitrarily  and  was  set  to  0.0  in  this  study.  While  it  has 
been  shown  that  the  SPRT  achieves  a  selected  confi¬ 
dence  in  the  shortest  decision  time,5  it  may  not  always 
arrive  at  a  decision.  However,  when  a  decision  was 
made,  we  noted  the  decision,  stuck  to  it,  and  restarted 
the  SPRT  process  from  that  time  point  until  a  new 
decision  emerged. 


Investigational  Metrics 

We  compared  the  performance  of  the  3  data-processing 
methods  using  testing  data  from  278  patients  where  we 
evaluated  the  accuracy,  latency,  and  consistency  (in 
a  sense  to  be  defined)  of  the  methods  in  aggregate  using 
the  following  5  performance  metrics: 

1.  Sensitivity:  at  any  given  time  f,  the  fraction  of 
patients  with  traumatic  injury  and  blood  loss  who 
were  correctly  identified  by  the  algorithm  at  time  f; 

2.  Specificity:  at  any  given  time  t,  the  fraction  of  control 
patients  who  were  correctly  identified  by  the  algo¬ 
rithm  at  time  f; 


3.  No  decisions:  at  any  given  time  f,  the  fraction  of 
patients  without  a  decision  out  of  the  total  number 
of  patients; 

4.  Cumulative  decision  changes:  the  cumulative  count 
up  through  time  t  of  decision  changes  Dc ;  and 

5.  False-alarm-affected  patients:  the  fraction  of  control 
patients  incorrectly  identified  as  having  traumatic 
injury  with  blood  loss,  at  or  before  time  f,  out  of  the 
total  number  of  patients. 

Every  2  min,  from  2  to  16  min  of  transport  time,  we 
performed  statistical  tests  of  significance  with  pair¬ 
wise  comparisons  between  the  investigational  meth¬ 
ods  (i.e.,  moving  window,  time  averaging,  and 
SPRT).  For  proportions  (sensitivity,  specificity,  no 
decisions,  and  false-alarm-affected  patients),  we 
employed  Liddell’s  exact  test.19  The  counts  of  total 
decision  changes  throughout  the  population  cannot 
be  statistically  evaluated,  so  we  also  computed  the 
total  decision  changes  per  subject  and  applied  the  Wil- 
coxon  signed-rank  test  to  the  distributions.  For  all  sta¬ 
tistical  tests,  we  considered  a  P  value  of  <  0.05  to  be 
statistically  significant. 


RESULTS 

Figure  1  illustrates  the  continual  output  of  the  3 
data-processing  methods,  the  moving  window, 
time-averaging,  and  SPRT  methods,  in  monitoring  4 
control  subjects  (panel  A)  and  3  subjects  with  trau¬ 
matic  injury  and  blood  loss  (panel  B).  Each  tile  in 
the  figure  represents  a  15-s  outcome  decision,  with 
red  (or  dark)  representing  traumatic  injury  with  blood 
loss  decisions,  green  (or  medium  gray)  control,  and 
yellow  (or  light  gray)  no  decisions.  The  selected  con¬ 
trol  subjects  illustrate  different  patterns  in  outcome 
decisions  that  we  observed  in  the  248  control  subjects 
in  the  testing  data.  For  example,  for  subject  364,  all  3 
methods  made  correct  and  consistent  control  deci¬ 
sions  over  the  16-min  transport  time.  For  subject 
607,  each  method  generated  some  false-positive 
(i.e.,  false  traumatic  injury  with  blood  loss)  decisions. 
However,  the  moving  window  method  generated  the 
most  frequent  number  of  decision  changes  (3  changes 
from  control  to  traumatic  injury  with  blood  loss  and  3 
from  traumatic  injury  with  blood  loss  to  control,  for 
a  total  of  6  decision  changes),  while  the  other  2  meth¬ 
ods  generated  2  decision  changes  each.  For  the  third 
subject  (640),  unlike  the  other  2  methods,  the  SPRT 
method  avoided  making  incorrect  decisions  (and 
decision  changes),  but  the  decision  was  delayed  by 
more  than  4  min.  Finally,  for  subject  749,  the  SPRT 
was  not  able  to  make  a  definite  decision  during  the 
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Figure  1  Continual  outcome  decisions  over  the  16  min  of  transport  time  for  each  of  the  3  data-processing  methods .  (A)  Selected  pattern  for 
4  control  subjects,  and  (B)  3  subjects  with  traumatic  injury  and  blood  loss.  Each  tile  represents  a  15-s  outcome  decision,  with  red  (or  dark) 
representing  traumatic  injury  with  blood  loss  decisions,  green  (or  medium  gray)  control,  and  yellow  (or  light  gray)  no  decisions.  SPRT, 
sequential  probability  ratio  test. 


16-min  transport  time,  while  the  other  2  methods 
generated  decision  changes  and  mostly  incorrect 
decisions. 

Panel  B  illustrates  3  patterns  of  decisions  observed 
within  the  31  patients  in  the  testing  set  with  trau¬ 
matic  injury  and  blood  loss:  for  subject  580,  all  meth¬ 
ods  generated  a  consistent  decision;  for  subject  64, 
the  methods  generated  intermittent  false-negative 
(i.e.,  false  control)  decisions,  with  the  moving  win¬ 
dow  method  yielding  an  incorrect  decision  at  16 
min;  and  for  subject  46,  all  methods  generated  the 
correct  final  decision — however,  the  moving  window 
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produced  decision  changes  and  some  incorrect  deci¬ 
sions,  while  the  SPRT  did  not  produce  a  decision 
until  almost  4  min. 

Figure  2  illustrates  the  performance  of  the  methods 
based  on  the  5  metrics  (sensitivity,  specificity,  no  deci¬ 
sions,  cumulative  decision  changes,  and  false-alarm- 
affected  patients)  used  to  evaluate  the  accuracies, 
latencies,  and  consistencies  of  the  methods  for  the 
278  testing  subjects  over  the  16-min  transport  time. 
Each  of  the  3  methods — moving  window,  time  averag¬ 
ing,  and  SPRT — yielded  comparable  performance  in 
terms  of  sensitivity  and  specificity  at  the  end  of  the 
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Figure  2  Comparison  of  3  data-processing  decision  methods  for  the  278  testing  subjects  analyzed  over  the  1 6-min  transport  time  based  on 
5  performance  metrics:  (A)  sensitivity,  (B)  specificity,  (C)  fraction  of  patients  with  no  decisions,  (D)  cumulative  number  of  decision  changes, 
and  (E)  false-alarm-affected  patients.  Pairwise  tests  of  significance  were  performed  every  2  min.  Proportions  were  compared  by  Liddell’s 
exact  test  (panels  A-C,  E).  Panel  D  illustrates  cumulative  count  of  total-population  decision  changes,  and  the  Wilcoxon  signed-rank  test 
was  applied  to  the  per  patient  counts  of  decision  changes.*?  <  0.05,  time  averaging  v.  moving  window.  +P  <  0.05,  SPRT  v.  both  moving 
window  and  time  averaging. 


transport  time  (sensitivity:  83%,  80%,  and  80%, 
respectively;  specificity:  71%,  75%,  and  73%,  respec¬ 
tively).  Note  that  the  SPRT  method  provided  relatively 
low  sensitivity  and  specificity  (<  60%)  during  the  first 
6  min  of  transport  because  of  a  large  fraction  of  patients 
without  SPRT  decisions  (see  panel  C).  For  instance,  at 
2  min,  fewer  than  25%  of  the  patients  had  a  decision 


230  •  MEDICAL  DECISION  MAKING/FEB  2013 


rendered  by  the  SPRT,  and  consequently,  the  corre¬ 
sponding  sensitivity  was  also  less  than  25%.  The 
SPRT  method  failed  to  make  a  decision  at  16  min  for 
8  subjects  (or  3%  of  the  subjects),  while  the  other  2 
methods  showed  no  decision  latency  (panel  C). 

In  terms  of  consistency  of  decisions,  the  SPRT 
demonstrated  a  significantly  reduced  fraction  of 
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false-alarm-affected  patients  throughout  and  at  the 
end  of  the  16-min  transport,  compared  with  both 
other  methods — 27%  of  the  subjects,  which  was 
36%  lower  than  the  time-averaging  method  (42%  of 
the  subjects)  and  48%  lower  than  the  moving  window 
method  (52%  of  the  subjects;  panel  E).  The  SPRT  also 
consistently  generated  fewer  numbers  of  decision 
changes  over  time  (29  total  decision  changes  v.  118 
for  the  time-averaging  method  and  348  for  the  moving 
window  method;  panel  D). 

The  time-averaging  method  was  more  consistent 
than  the  moving  window  method,  with  significantly 
fewer  false-alarm-affected  patients  and  average  deci¬ 
sion  changes  per  patient.  The  time-averaging  method 
did  not  demonstrate  the  latency  of  the  SPRT  method. 

DISCUSSION 

In  this  article,  we  studied  the  accuracy,  consis¬ 
tency,  and  latency  of  a  decision-support  classifier 
employing  three  different  data-processing  methods 
for  the  continual  prehospital  diagnosis  of  traumatic 
injury  with  blood  loss  in  557  trauma  patients.  It  is 
striking  that  all  methods  showed  very  similar  sensi¬ 
tivities  and  specificities  yet  very  different  temporal 
behaviors.  For  instance,  Wald’s  SPRT  was  much 
more  consistent,  generating  false  alarms  in  signifi¬ 
cantly  fewer  patients,  with  significantly  fewer  deci¬ 
sion  changes. 

There  are  2  major  implications.  First,  for  some  con¬ 
tinual  monitoring  applications,  standard  test  charac¬ 
teristics,  e.g.,  sensitivity  and  specificity,  are 
insufficient  for  describing  the  performance  of  a  classi¬ 
fier  because  they  do  not  describe  if  false  alarms  occur 
repeatedly  in  a  limited  subpopulation  or  if  false  alarms 
are  evenly  distributed  throughout  a  population.  Sec¬ 
ond,  as  a  corollary,  it  is  apparent  that  pre-  and  postpro¬ 
cessing  of  time-series  data  can  significantly  alter 
temporal  consistency,  as  was  seen  in  the  application 
of  time  averaging  and  of  the  SPRT,  a  classic  technique 
intended  for  precisely  this  type  of  application. 

Insufficiency  of  standard  test  characteristics  for 
describing  the  continual  performance  of  a  classifier. 
For  the  continual  monitoring  of  patients,  standard 
test  characteristics  do  not  consider  the  sequential 
nature  of  the  algorithm’s  decisions  when  there  are 
repeated  decisions  being  made  on  each  subject.  For 
example,  while  2  binary  decision  classifiers  may 
have  similar  overall  sensitivity  and  specificity,  1 
may  be  less  stable  than  the  other,  continually  “flip¬ 
ping”  its  decisions  through  time  (which  is  naturally 
exacerbated  the  more  that  a  classifier  is  sensitive  to 
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transient  noise  in  the  signal).  We  found  this  exact 
phenomenon  in  our  data  set:  After  5  to  10  min,  the 
3  investigational  methods  had  similar  sensitivities 
and  specificities,  but  there  were  significant  differen¬ 
ces  in  the  total  number  of  patients  affected  by  a  false 
alarm.  Using  the  SPRT  significantly  reduced  the 
fraction  of  false-alarm-affected  patients  by  approxi¬ 
mately  half,  compared  to  the  moving  window 
method. 

We  speculate  that  this  effect  was  notable  in  this 
analysis  because  the  prehospital  vital  signs  showed 
considerable  intrasubject  variability  through  time, 
with  sizable  fluctuations  in  HR,  blood  pressure, 
etc.,  during  the  course  of  prehospital  transport.14 
Comparable  fluctuations  in  the  prehospital  vital 
signs  of  trauma  patients  have  been  observed  in  other 
prehospital  studies  as  well,20-22  which  may  be  phys¬ 
iological  responses  to  episodic  stimuli  (e.g.,  pain  and 
fear),  to  episodic  therapies  (e.g.,  fluids),  or  to  underly¬ 
ing  pathology,  as  well  as  some  degree  of  routine  bio¬ 
logical  variability  and  measurement  error. 

In  general,  are  standard  diagnostic  test  characteris¬ 
tics  sufficient  for  the  assessment  of  continual  patient 
monitoring,  or  is  it  appropriate  to  quantify  classifier 
consistency?  It  is  likely  that  the  frequency  of  decision 
changes  in  diagnostic  classification  is  dependent  on 
the  classifier  evaluation  frequency,  the  temporal  fluc¬ 
tuations  in  the  diagnostic  data,  and  the  proximity  of 
the  classifier  output  to  the  decision  boundary.  Pre¬ 
sumably,  there  is  a  continuum  of  diagnostic  applica¬ 
tions  in  terms  of  the  classifier  consistency  through 
time.  If  the  diagnostic  data  are  temporally  stable  dur¬ 
ing  intervals  of  disease  and  health,  then  standard  test 
characteristics  are  likely  sufficient.  At  the  other 
extreme,  if  the  diagnostic  data  fluctuate  through 
time,  then  the  diagnostic  classification  will  also  fluc¬ 
tuate  through  time,  and  it  maybe  illuminating  to  con¬ 
sider  metrics  of  consistency  (as  we  have  done  in  this 
report)  in  addition  to  standard  test  characteristics.  In 
many  reports,  continual  classifiers  are  evaluated 
without  explicit  consideration  of  their  performance 
and  consistency  through  time,  such  as  reports  by 
our  group12  and  by  others.23-25  It  is  likely  that,  at  least 
for  a  subset  of  continual  monitoring  applications, 
standard  diagnostic  test  characteristics  are  insuffi¬ 
cient  and  it  would  be  valuable  to  consider  consis¬ 
tency  to  quantify  clinically  relevant  properties  of 
the  diagnostic  test. 

In  addition,  evaluating  a  temporal  classifier 
through  time  can  reveal  if  performance  changes 
because  of  temporal  disease  progression.  Presum¬ 
ably,  it  is  easier  to  diagnose  blood  loss  or  septic  shock 
as  the  pathology  progresses,  due  to  the  spectrum 
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effect  (e.g.,  when  a  diagnostic  test  performs  better  in 
a  study  population  with  more  severe  disease.  Con¬ 
sider  that  the  sensitivity  of  a  hypothetical  dip-test 
for  leukocyte  esterase  in  the  diagnosis  of  urinary  tract 
infection  may  be  higher  in  patients  of  an  underserved 
population,  who  tend  to  receive  evaluation  later  in 
the  course  of  disease,  rather  than  in  patients  of  an 
affluent  population,  who  are  promptly  evaluated 
after  the  earliest  symptoms).  Spectrum  effects  also 
affect  the  temporal  consistency  of  a  diagnostic  classi¬ 
fier,  because  small  fluctuations  in  diagnostic  data  for 
a  borderline  case  would  be  more  likely  to  affect  diag¬ 
nostic  classification  (e.g.,  during  early  stages  of  blood 
loss).  By  contrast,  cases  with  more  advanced  pathol¬ 
ogy  will  often  have  more  frankly  abnormal  diagnostic 
data,  and  so  temporal  fluctuations  are  unlikely  to 
alter  diagnostic  classification.  That  diagnostic  classi¬ 
fication  may  become  easier  as  the  disease  process 
progresses  is  often  well  recognized.  For  instance, 
Cuthbertson26  reported  test  characteristics  for  an 
investigative  early  warning  score  over  hourly  inter¬ 
vals,  e.g.,  1  h  prior  to  patient  acute  deterioration,  2  h 
prior,  etc.  However,  it  was  not  reported  to  what  extent 
the  true  and  false  alarms  occurred  in  the  same  patients 
hour  by  hour,  i.e.,  consistency.  In  this  report,  we 
describe  the  minute-by-minute  performance  of  an 
investigational  algorithm  during  the  initial  16  min  of 
prehospital  transportation,  including  the  temporal 
variation  of  decision  changes  in  the  same  patients 
and  the  fraction  of  total  patients  affected  by  some  of 
these  changes.  At  least  in  our  application,  the  addi¬ 
tional  statistics  provide  information  beyond  standard 
test  characteristics,  perhaps  in  part  because  we  exam¬ 
ined  data  measured  soon  after  traumatic  injury. 

Pre-  and  postprocessing  of  time  series  alters  per¬ 
formance  of  an  automated  continual  classifier.  Pre- 
and  postprocessing  of  time-series  data  is  appropriate 
for  removing  noise  that  occurs  over  faster  time  scales 
than  the  process  of  interest,  thus  enhancing  the 
underlying  signal.  In  this  study,  the  narrow  2-min 
moving  window  caused  a  large  number  of  patients 
to  trigger  false  alarms  (24%  more  than  the  time-averag¬ 
ing  approach  and  93%  more  than  the  SPRT  approach). 
Failure  of  developers  of  monitoring  algorithms  to 
explicitly  consider  classifier  output  stability,  or  con¬ 
sistency,  through  time  will  presumably  exacerbate 
the  well-described  problem  of  false  alarms  in  medical 
monitoring  systems1-4  and  will  likely  decrease  the 
incentive  for  caregivers  to  adopt  novel  decision-sup¬ 
port  technologies.  Conversely,  excessively  stable  clas¬ 
sifiers  are  also  problematic,  causing  unacceptable 
latency  when  a  patient’s  state  does  change.  The 
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challenge  is  to  optimize  the  tradeoffs  between  classi¬ 
fier  accuracy,  consistency,  and  latency. 

Consider  time  averaging.  As  long  as  the  noise  in 
the  time  series  has  no  major  bias,  this  is  a  practical 
technique  for  filtering  out  measurement  error  and 
transient  physiological  events.  For  a  monitoring  algo¬ 
rithm,  the  time-averaging  window  should  be  shorter 
than  the  onset  time  of  the  disease  of  interest.  In  other 
words,  time  averaging  over  15  min  may  be  useful 
when  seeking  hemorrhage  physiology,  although 
time  averaging  over  60  min  might  be  too  large  a  win¬ 
dow,  causing  unacceptable  latency  to  the  detection  of 
hemorrhage  physiology  that  can  progress  in  less  than 
an  hour.  In  this  report,  the  time-averaging  method 
was  able  to  improve  decision  consistency  (with 
66%  fewer  decision  changes)  and  reduce  false- 
alarm-affected  patients  (with  20%  fewer  false- 
alarm-affected  patients)  compared  with  the  simple 
2-min  moving  window  method. 

A  prior  report  corroborates  this  principle:  that  it  is 
often  possible  to  reduce  false  alarms  at  the  expense  of 
clinically  acceptable  latency.  In  monitoring  children 
at  home  by  pulse  oximetry,  Gelinas  and  others27  sug¬ 
gested  that  the  rate  of  hypoxia  alarms  (Sp02  <  85%) 
could  be  reduced  from  3.6  to  0.2  alarms  per  night  with¬ 
out  missing  any  clinically  significant  events,  simply  by 
requiring  a  10-s  duration  of  hypoxia  (rather  than  alarm¬ 
ing  the  instant  that  the  hypoxia  threshold  was  met). 

The  SPRT:  a  classic  technique  that  can  improve 
temporal  consistency  during  continual  monitoring. 
One  classic  application  of  the  SPRT  is  for  the  evalu¬ 
ation  of  a  shipment  of  manufactured  components. 
Components  are  measured  1  by  1  until  a  SPRT  deci¬ 
sion  is  rendered  that  the  set  of  components  is  within 
(or  outside  of)  the  acceptable  tolerances.  Our  investi¬ 
gational  algorithm  is  analogous  in  that  measurements 
were  taken  repeatedly  from  1  trauma  patient,  and  the 
SPRT  was  used  to  decide  whether  the  patient  was 
within  (or  outside  of)  the  range  of  vital  signs  typical 
of  patients  with  traumatic  injury  and  blood  loss.  Of 
course,  given  a  shipment  of  components,  individual 
measurements  are  statistically  independent,  while 
there  is  temporal  correlation  when  measurements 
are  repeated  in  the  same  patient.  Regardless,  our  find¬ 
ings  suggest  that  the  SPRT  is  suitable  for  improving 
the  consistency  of  the  investigational  classifier  based 
on  continual  physiological  data. 

In  the  medical  area,  the  SPRT  has  been  previously 
applied  to  the  performance  monitoring  of  clinical 
teams26-30  (to  continually  monitor  the  surgical  out¬ 
come  rate  and  ensure  it  does  not  deviate  from  the 
expected  success  rate),  routine  surveillance  of  drug 
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safety31  (to  continually  monitor  whether  a  new  vac¬ 
cine  is  safe  over  a  period  of  time),  and  determination 
of  early  stopping  criteria  of  clinical  trials32,33  (to  allow 
the  trial  to  be  stopped  as  soon  as  the  information  accu¬ 
mulated  is  considered  sufficient  to  reach  a  conclu¬ 
sion).  Our  results  demonstrated  that  the  SPRT  may 
be  effective  for  continual  physiological  monitoring, 
in  the  reduction  of  false-alarm-affected  patients  (36% 
fewer  patients  than  the  time-averaging  method)  and 
overall  decision  changes  (75%  fewer  decision 
changes).  The  tradeoff  was  the  occurrence  of  some 
decision  latency  because,  unlike  the  other  investiga¬ 
tive  methods,  the  SPRT  can  yield  an  “undecided”  out¬ 
put  (see  Figure  2).  Indeed,  for  several  cases  (3%  of  the 
total),  there  was  never  a  diagnostic  decision  generated 
when  applying  the  SPRT.  For  applications  in  which 
such  a  tradeoff  is  acceptable,  the  SPRT  is  optimal  in 
the  sense  that,  mathematically,  it  guarantees  the  small¬ 
est  number  of  samples  to  achieve  a  decision  for  given 
false-positive  and  false-negative  probabilities.5  The 
performance  of  the  SPRT  depends  on  the  selected 
nominal  probabilities  a  and  (3,  which  can  be  set  either 
arbitrarily  or  by  optimizing  certain  cost  function  dur¬ 
ing  classifier  training.  Properly  chosen  a  and  |3  may 
improve  the  sensitivity  and  specificity,  and  decrease 
the  cumulative  incidences  of  decision  changes,  with 
acceptable  final  unresolved  decisions.  However, 
improperly  chosen  a  and  |3  may  significantly  down¬ 
grade  the  sensitivity  or  the  specificity.  As  well,  when 
we  first  attempted  to  optimize  the  SPRT  with  a  cost  func¬ 
tion  customized  wholly  to  yield  small  false-positive  a 
and  false-negative  |3  probabilities,  we  improved  the  final 
accuracy  but  simultaneously  increased  the  unresolved 
decisions  to  40%  on  the  testing  data.  In  the  end,  the 
cost  function  defined  in  equation  3  provided  a  simple 
yet  effective  tool  to  balance  accuracy,  consistency,  and 
latency. 

This  tradeoff  between  latency  and  consistency  may 
limit  the  application  of  the  SPRT  in  the  detection  of  con¬ 
ditions  that  involve  an  imminent  threat  to  life,  e.g.,  car¬ 
diac  tachyarrhythmia.  However,  in  the  monitoring  of 
early  disease  states,  when  some  latency  is  acceptable, 
e.g.,  early  hemorrhage  detection,12  sepsis  detection,25,34 
or  other  early  warning  functionality,23,24  we  suggest 
that  the  SPRT  may  provide  a  means  to  improve  classifier 
stability  and  to  reduce  false  alarms,  without  any  neces¬ 
sary  loss  in  decision  accuracy. 

Identification  of  traumatic  injury  with  blood  loss 
via  continual  physiological  monitoring.  The  potential 
usefulness  of  the  diagnostic  classifier  described  in  this 
report  is  not  the  focus  of  this  study,  and  an  assessment 
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of  potential  clinical  value  must  be  tempered  by  the  fact 
that  the  analysis  is  retrospective,  based  on  post  hoc 
classification  as  to  whether  each  subject  had  traumatic 
injury  with  blood  loss.  Having  said  that,  we  believe  that 
there  is  potential  clinical  value  to  the  methodological 
application  of  conventional  and  commonsense  analysis 
techniques  to  standard  vital-sign  data,  e.g.,  noise  rejec¬ 
tion,  time  averaging,  and  multivariate  classification.  We 
previously  found  that  automated  techniques  are  diag¬ 
nostically  equivalent  to  prehospital  severity  scores 
based  on  medics’  documentation.15  In  this  case,  we 
focused  on  the  identification  of  hemorrhage  because 
blood  loss  is  1  of  the  2  primary  reasons  why  trauma 
patients  die,35,36  but  in  many  cases  it  can  be  treated 
effectively  with  blood  transfusion  and  surgical  hemor¬ 
rhage  control.  We  speculate  that  formal  quantitative 
analysis  of  continual  vital  signs  may  be  able  to  supple¬ 
ment  today’s  convention,  which  relies  on  informal  cli¬ 
nician  judgments  to  integrate  vital-sign  data  with  other 
important  clinical  data.  For  instance,  automated  algo¬ 
rithms  during  prehospital  care  could  be  useful  for  triage 
and  to  aid  the  receiving  hospital  to  efficiently  mobilize 
proper  resources,  such  as  surgical  teams  and  units  of 
blood.  Similar  techniques  could  identify  hospitalized 
patients  who  suffer  unexpected  episodes  of  blood  loss 
during  convalescence,  e.g.,  early  warning  systems. 
However,  actual  performance  and  clinical  usefulness 
must  be  prospectively  assessed,  and  the  optimal 
approach  to  decision  support  for  trauma  patients  (e.g., 
attempt  to  identify  any  patients  with  traumatic  injury 
and  blood  loss  v.  attempt  to  identify  patients  with 
uncontrolled,  ongoing  blood  loss)  involves  open  ques¬ 
tions  that  are  not  addressed  in  this  analysis. 


CONCLUSION 

Over  time,  all  3  methods  converged  to  demonstrate 
very  similar  diagnostic  accuracy  (i.e.,  sensitivity  and 
specificity).  However,  their  consistency  was  signifi¬ 
cantly  different.  The  SPRT  significantly  reduced  the 
total  number  of  patients  affected  by  false  alarms,  but 
with  significantly  greater  latency,  compared  with 
the  moving  window  method  and  the  time-averaging 
method.  Time  averaging  showed  significantly  fewer 
patients  affected  by  false  alarms  compared  with  mov¬ 
ing  window,  and  without  latency.  These  findings 
highlight  how  there  are  continual  monitoring  appli¬ 
cations  for  which  the  proposed  test  characteristics 
provide  additional,  useful  information.  Metrics  of 
consistency  and  latency  can  demonstrate  additional 
properties  that  are  likely  relevant  to  clinical  practice. 
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